Read me for: DeLuca, Kevin and John A. Curiel. "Validating the Applicability of Bayesian Inference with Surname and Geocoding to Congressional Redistricting."

This replication directory contains the necessary script and data to compute the non-personal identifying information (pii) results of our paper. The files are divided into the following: codefiles, data files, and results. Please direct any questions about these files to John A. Curiel at jcuriel@mit.edu. 

NOTE: The personal identifying information (pii) has been removed from these replication files. This includes the coordinate data from the Georgia voterfile, necessary for BISG using the wru package. These estimates are read in from the confidential pii related script.  


For the purposes of replication, our primary computer for the non-pii scripting used a desktop computer with the following specifications: Intel(R) Core(TM)
i7-9700 CPU 6-Core/12-Thread, 12MB Cache, 3.0 GHz, 64GB DDR4 Memory XMP at 2933MHz, 2TB HDD.

Note: For the replication code to work, it is important to download all the data files from the dataverse in their original file formats (e.g., RData files for the files marked below with that extension)


Code files:

simulation_redist_nc.R/simulation_redist_ga.R: These files compute the BISG predictions from the North Carolina and Georgia voter file, respectively, in addition to running the redistricting simulations and all other analyses within the manuscript. The estimated time to complete using the aforementioned computer is approximately 1.57 hours and 53.77 minutes respectively. R version: 4.0.2. 
packages read in:  redist (v 2.0.2), foreign (v 0.8-80), tidyr (v 1.1.1), dplyr (v 1.0.1), sp (v 1.4-2), raster (v 3.3.13), rgdal (v 1.5-15), rgeos (v 0.5.3), reshape2 (v 1.4.4), stringi (v 1.4.6), stringr (v 1.4.0), splitstackshape (v 1.4.8), data.table (v 1.13.0), ggplot2 (v 3.3.2), scales (v 1.1.1), wru (v 0.1-9), devtools (v 2.3.2), zipWRUext2 (v 0.0.0.9000)


NOTE: The arealOverlap package is currently from github, and needs to be installed using a devtools associated command, which is as follows: 
devtools::install_github("https://github.com/jcuriel-unc/arealOverlap2",subdir="arealOverlap")
library(arealOverlap)

NOTE: The zipWRUext2 is currently from github, and needs to be installed using a devtools associated command, which is as follows: 
devtools::install_github("https://github.com/jcuriel-unc/zipWRUext",subdir="zipWRUext2")



Input folders/files 

county-fips-codes.csv: A csv of the county names and fips codes within the U.S., used as a crosswalk for wru predictions for voter list entries where county level info must be used in place of alternatives. 

cenesus_nc_blockstats.rds: An R list object of North Carolina's demographic data by block as produced by the wru get_census_data command. 

cenesus_ga_blockstats.rds: An R list object of Georgia's demographic data by block as produced by the wru get_census_data command.

shpfiles: a folder of the necessary voter files, in addition to precinct, congressional, and census block shapefiles; consists of 

	congress/cb_2018_us_cd116_500k (.shp, .dbf, .prj, .shx): The Census cpmgressional shapefile from the 	U.S. Census TIGER Shapefiles for the 116th Congress. 

	SBE_PRECINCTS_09012012 (.shp, .dbf, .prj, .shx): The North Carolina precinct shapefiles from the North Carolina State Board of Elections (NCSBE) from September 12, 2012. https://dl.ncsbe.gov/?prefix=PrecinctMaps/

	nc_precinct/import_vr/vr2012snapshot.rds: The snapshot of the NC voter list from 2012, from the NCSBE. 

	nc_precinct_cd_edits4 (.shp, .dbf, .prj, .shx): The North Carolina congressional district basemap created by Curiel and Steelman (2018) in their Election Law Journal publication.

	ga_precinct_cd_edits (.shp, .dbf, .prj, .shx): The Georgia district basemap derived from the 	Congressional districts in place from the Georgia precinct shapefile. 

	ga_precinct/GAvoterfilesmall_anon.csv: An anonymized version of the GA voterfile purchased from the Secretary of State of Georgia; includes the necessary zip code, last name, and race fields necessary to run and validate BISG. 

	ga_precinct/ga_voterfile_by_precinct.rds: The Georgian voter residences overlaid onto precincts. 


Intermediary files 

	nc_voter_reg_cleanedwbisg (.rds/.csv): The saved cleaned voter list with BISG predictions by estimated probabilities and plurality, in addition to the race recorded by the state. 6649049 rows and 37 columns. Fields of interest are as follows.
	county_desc - Upper case county name where voter is registered. Source: NCSBE. 
	zcta5 - The five digit zip code of the voter's residential address. Source: NCSBE.
	voter_status_desc - The voter's registration status, either active or inactive. Source: NCSBE
	surname - The voter's last name, used for the BISG estimations. Source: NCBSE
	race_code - The abbreviated one letter racial code of the voter's self reported race. Source: NCBSE
	race_desc - The fully specified self reported race of the voter. Source: NCBSE.
	ethnic_desc - Self reported ethnicity of whether the voter is Hispanic. Source: NCBSE.
	party_desc - The voter's party registration. Source: NCBSE.
	precinct_abbrv - The numeric code of the precinct where the voter is registered. Source: NCBSE.
	vtd_abbrv/vtd_desc - Ibid. 
	precinct_desc - The non-numeric code of the precinct where the voter is registered. Source: NCBSE.
	cong_dist_abbrv - The string padded congressional district number where the voter is registered. Source: NCBSE 
	white - a dummy variable as to whether the voter's self reported race is white. 
	black - a dummy variable as to whether the voter's self reported race is black.
	other_race - a dummy variable as to whether the voter's self reported race is other than white or black.
	democrat - a dummy variable as to whether the voter's registration is democratic.
	republican - a dummy variable as to whether the voter's registration is republican.
	other_party -  a dummy variable as to whether the voter's registration is other than democratic or republican
	surname.match - The text field of the matched surname from the wru surname dictionary. If unmatched, 	then the voter's surname not successfully matched. 
	pred.(whi/bla/his/asi/oth) - The predicted probability that a voter is of the mutually exclusive races of white, black, hispanic, asian, or other, as determined using the wru and zipWRUext2 packages.
	county_fips - The five digit county FIPS code. 
	County - The three digit coutny FIPS code (the 2 digit state FIPS removed)
	herf_weight - The Herfindahl index for the diversity of racial categories, where scores approximating one equivalent to only one effective race predicted, and scores closer to zero complete uncertainty given a near infinite effective number of races predicted. 
	max_race_prob - The probability that an individual is their estimated plurality race. 
	plural_race - The voter's estimated plurality racial category of either white, black, hispanic, 	asian, or other. 
	white_plural - A dummy variable as to whether the BISG plurality estimated race of the voter is 	white.
	black_plural - A dummy variable as to whether the BISG plurality estimated race of the voter is 		black.
	other_plural - A dummy variable as to whether the BISG plurality estimated race of the voter is 		other.

	ga_voter_reg_cleanedwbisg (.rds/.csv): The saved cleaned voter list with BISG predictions by estimated probabilities and plurality, in addition to the race recorded by the state. 7,344,555 rows and 32 columns. Fields of interest are the same as the above for  nc_voter_reg_cleanedwbisg. 

	nc_precincts_demos.rds - The North Carolina precinct spatial file, with the census block demographics overlaid. A total of 2746 polygons, and 7 columns and from the NCSBE. The fields of interest are as follows:

	COUNTY_NAM - Upper case county name
	PREC_ID - The numeric code of the precinct.
	ENR_DESC - The non-numeric code of the precinct.
	white - The number of white individuals as estimated from 2010 Census block info
	black - The number of black individuals as estimated from 2010 Census block info
	other_races - The number of a race that is not black or white, as estimated from 2010 Census block 	info

	nc_prec2merged.rds - The R spatial data of precincts (nc_precincts_demos.rds) with the precinct-summarized data from the nc_voter_reg_cleanedwbisg merged on. These include all of the relevant race/ethnicity fields in addition to:

		total - The total number of registered voters 
		cb_total - The total number of individuals, as estimated from the U.S. Census. 



  
Output folders/files 

nc_output: A repository folder for the outputted redistricting simulation data for NC 

	alg_census_store_loopn.Rdata - files with the saved with the nth simulation from the redist package simulations for NC. 

	alg_census_store.Rdata - files with the saved with the nth simulation from the redist package simulations for NC.

	mpsa_least2most_blackprec_dist.Rdata - A dataframe with 13 columns reflecting the 13 districts, rows reflecting each iteration of a simulation, and the cells reflecting the percent of the population registered as black, according to the NCSBE voter file. Organized from least to most Black districts. 

	mpsa_least2most_blackbisg_dist.Rdata - A dataframe with 13 columns reflecting the 13 districts, rows reflecting each iteration of a simulation, and the cells reflecting the percent of the population registered as black, according to the BISG of the NCSBE voterfile, using probability summing. Organized from least to most Black district. 

	mpsa_least2most_blackbisg_plural_dist.Rdata - A dataframe with 13 columns reflecting the 13 districts, rows reflecting each iteration of a simulation, and the cells reflecting the percent of the population registered as black, according to the BISG of the NCSBE voterfile, using plurality assignment. Organized from least to most Black district. 

	mpsa_least2most_whiteprec_dist.Rdata - A dataframe with 13 columns reflecting the 13 districts, rows reflecting each iteration of a simulation, and the cells reflecting the percent of the population registered as white, according to the NCSBE voter file. Organized from least to most White districts. 

	mpsa_least2most_whitebisg_dist.Rdata - A dataframe with 13 columns reflecting the 13 districts, rows reflecting each iteration of a simulation, and the cells reflecting the percent of the population registered as white, according to the BISG of the NCSBE voterfile, using probability summing. Organized from least to most White district.

	mpsa_least2most_whitebisg_plural_dist.Rdata - A dataframe with 13 columns reflecting the 13 districts, rows reflecting each iteration of a simulation, and the cells reflecting the percent of the population registered as white, according to the BISG of the NCSBE voterfile, using plurality assignment. Organized from least to most White district. 


ga_output: A repository folder for the outputted redistricting simulation data for GA

	alg_census_store_ga_loopn.Rdata - files with the saved with the nth simulation from the redist package simulations for GA. 

	alg_census_store_ga.Rdata - files with the saved with the nth simulation from the redist package simulations for GA.

	mpsa_least2most_blackprec_dist_ga.rds - A dataframe with 13 columns reflecting the 13 districts, rows reflecting each iteration of a simulation, and the cells reflecting the percent of the population registered as black, according to the NCSBE voter file. Organized from least to most Black districts. 

	mpsa_least2most_blackbisg_dist_ga.rds - A dataframe with 13 columns reflecting the 13 districts, rows reflecting each iteration of a simulation, and the cells reflecting the percent of the population registered as black, according to the BISG of the NCSBE voterfile, using probability summing. Organized from least to most Black district. 

	mpsa_least2most_blackbisg_plural_dist_ga.rds - A dataframe with 13 columns reflecting the 13 districts, rows reflecting each iteration of a simulation, and the cells reflecting the percent of the population registered as black, according to the BISG of the NCSBE voterfile, using plurality assignment. Organized from least to most Black district. 

	mpsa_least2most_whiteprec_dist_ga.rds - A dataframe with 13 columns reflecting the 13 districts, rows reflecting each iteration of a simulation, and the cells reflecting the percent of the population registered as white, according to the NCSBE voter file. Organized from least to most White districts. 

	mpsa_least2most_whitebisg_dist_ga.rds - A dataframe with 13 columns reflecting the 13 districts, rows reflecting each iteration of a simulation, and the cells reflecting the percent of the population registered as white, according to the BISG of the NCSBE voterfile, using probability summing. Organized from least to most White district.

	mpsa_least2most_whitebisg_plural_dist_ga.rds - A dataframe with 13 columns reflecting the 13 districts, rows reflecting each iteration of a simulation, and the cells reflecting the percent of the population registered as white, according to the BISG of the NCSBE voterfile, using plurality assignment. Organized from least to most White district. 




plots: A repository folder of the outputted plots

Main figures 

	white_density_error_plot_nc.png - Figure 1a, produced from the simulation_redist_nc.R script

	white_density_error_plot_ga.png - Figure 1b, produced from the simulation_redist_ga.R script

	black_density_error_plot_nc.png - Figure 1c, produced from the simulation_redist_nc.R script

	black_density_error_plot_ga.png - Figure 1d, produced from the simulation_redist_ga.R script

	white_bisg_error_sensitivity_nc.jpg - Figure 2a, produced from the simulation_redist_nc.R script

	white_bisg_error_sensitivity_ga.jpg - Figure 2b, produced from the simulation_redist_ga.R script

	black_bisg_error_sensitivity_nc.jpg - Figure 2c, produced from the simulation_redist_nc.R script

	black_bisg_error_sensitivity_ga.jpg - Figure 2d, produced from the simulation_redist_ga.R script


Supplementary material figures 

	nc_herfindahl_density.png - Figure S2(a), produced from the simulation_redist_nc.R script. 

	ga_herfindahl_density.png - Figure S2(b), produced from the simulation_redist_ga.R script. 

	nc_err_b_white_herf.png - Figure S3, top left panel, produced from simulation_redist_nc.R script.

	nc_err_b_black_herf.png - Figure S3, bottom left panel, produced from simulation_redist_nc.R script.

	nc_err_p_white_herf.png - Figure S3, top right panel, produced from simulation_redist_nc.R script.

	nc_err_p_black_herf.png - Figure S3, bottom right panel, produced from simulation_redist_nc.R script.

	ga_err_b_white_herf.png - Figure S4, top left panel, produced from simulation_redist_ga.R script.

	ga_err_b_black_herf.png - Figure S4, bottom left panel, produced from simulation_redist_ga.R script.

	ga_err_p_white_herf.png - Figure S4, top right panel, produced from simulation_redist_ga.R script.

	ga_err_p_black_herf.png - Figure S4, bottom right panel, produced from simulation_redist_ga.R script.

	
